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ABSTRACT 

qq ' We describe techniques to characterise the light-curves of regular variable stars by 

, applying principal component analysis (PCA) to a training set of high quality data, 

and to fit the resulting light-curve templates to sparse and noisy photometry to obtain 
parameters such as periods, mean magnitudes etc. The PCA approach allows us to 
efficiently represent the multi-band light-curve shapes of each variable, and hence 
quantitatively describe the average behaviour of the sample as a smoothly varying 
function of period, and also the range of variation around this average. 
. In this paper we focus particularly on the utility of such methods for analysing 

i' HST Cepheid photometry, and present simulations which illustrate the advantages 

' \ of our PCA template-fitting approach. These are: accurate parameter determination, 

including light-curve shape information; simultaneous fitting to multiple passbands; 
' quantitative error analysis; objective rejection of variables with non Cepheid-like light- 

ed \ curves or those with potential period aliases. 

!• ■ We also use PCA to confirm that Cepheid light-curve shapes are systematically 

. £^ ' different (at the same period) between the Milky Way (MW) and the Large and 

, Small Magellanic Clouds (LMC, SMC), and consider whether light-curve shape might 

5_i ■ therefore be used to estimate the mean metallicities of Cepheid samples, thus allowing 

metallicity corrections to be applied to derived distance estimates. 

Key words: Cepheids, variable stars - general 



1 INTRODUCTION 

Many astrophysical investigations rely on the determination 
of parameters of periodic variable stars. Notably, the use of 
Cepheid variables as distance indicators requires estimation 
of periods and (usually) intensity-mean magnitudes in or- 
der to establish a period-apparent luminosity relation. With 
sparse and noisy data this is hard to do reliably. Given the 
large investment of HST time in observations of Cepheids in 
nearby galaxies (eg. Freedman et al. 2001; Saha et al. 2001; 
Tanvir et al. 1995) , it is particularly important for the tech- 
niques employed to be as accurate and efficient as possible. 

A number of algorithms have been developed to objec- 
tively estimate variable star parameters. Notably the "string 
length" method of Lafler and Kinman (1965), which essen- 
tially minimises square magnitude differences between suc- 
cessive phased data-points, is still frequently used to deter- 
mine periods. This method works well, especially with pre- 



cise and well-sampled data, but is likely to be less secure 
with data "at the limit" - i.e. close to the limiting appar- 
ent magnitude of the photometry and/or with sparse phase 
coverage. To find intensity-mean magnitudes, many authors 
use the phase- weighted method suggested by Saha & Hoes- 
sel (1990), which makes allowance for the non-uniform sam- 
pling of the light-curve in time. Again, this works well with 
good data, but is potentially inefficient (in the sense of not 
making full use of all the data) with sparsely-sampled data. 

Most HST studies have gone one step further in using 
the shape of the light-curve in the V band to predict its form 
in the / band, and hence to allow the I band intensity-mean 
magnitude to be estimated from only a very few photometric 
data-points. The motivation for this approach is to provide 
colour information relatively cheaply, which is required to 
estimate - and then correct for - reddening by dust. 

The simplest such recipe (Freedman 1988) uses only 
prior knowledge of the typical ratio of V to / band ampli- 
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tude and the typical phase shift between V and / bands at 
maximum-light for Cepheids. With this model, a correction 
can be made to the I mean photometry, assuming that it is 
the same as the correction which would have to be applied 
to an equivalently undersampled V light curve, multiplied 
by the adopted ratio of amplitudes. Obviously, as with other 
similar methods, errors are introduced here, both those de- 
pendent on the V and / photometric quality (or lack thereof) 
and possibly also the accuracy of the prior information. Sub- 
sequently a rather more sophisticated algorithm was devel- 
oped by Labhardt, Sandage & Tammann 1997. This involves 
predicting and fitting a template light-curve in the I band 
based on the parameters (i.e. the period, phase, amplitude 
and shape) already determined from the V band data. The 
strong correlations between the light-curves of Cepheids in 
different bands make this a productive approach. 

Fitting template light curves as a means of estimating 
Cepheid parameters was first introduced by Stetson (1996) 
who used templates based on Fourier decomposition of a 
set of well-observed MW and LMC/SMC Cepheids. In his 
method, initial values of plausible periods are determined by 
string-length analysis, and then templates fitted with each of 
these periods as a starting point, and the overall amplitude 
left as a free parameter (in addition to the period, phase, and 
mean magnitudes) . A scoring system is then used to identify 
the most plausible fit. Stetson argued that the advantage of 
automated classification of variables and determination of 
their parameters is not so much that a computer algorithm 
will necessarily do better than an experienced human ana- 
lyst, but that the biases and systematics can be more easily 
studied and characterised. 

A further refinement to the Fourier-fitting method was 
presented in Ngeow et al. (2003), where "simulated anneal- 
ing" is used to improve the quality of the Fourier decom- 
position of sparsely-sampled HST V band light-curves. This 
technique restricts the allowed range for the Fourier am- 
plitudes in the minimisation procedure, and thus performs 
substantially better than conventional least-squares fitting 
on data with significant gaps in phase coverage. / band light 
curves are reconstructed from the V band using interrela- 
tions of the Fourier coefficients. 

A new approach to Cepheid light-curve template gener- 
ation was introduced by Tanvir et al. (1999; see also Hendry 
et al. 1999) who used Principal Component Analysis (PCA) 
to statistically characterise a training set of MW, LMC and 
SMC Cepheids, and fitted these templates to V and I data 
for HST observed Cepheids in M96. By fitting well-defined 
and realistic template curves, several parameters can be de- 
termined, together with estimates of their uncertainties. One 
of the very attractive features of this technique is that pho- 
tometry in different bands can be handled simultaneously, 
so that the natural correlations between bands are auto- 
matically built into the templates and all of the data is used 
to determine the parameters. Kanbur et al. (2002) described 
the PCA method in more detail, used it to investigate varia- 
tion in Cepheid light-curve structure as a function of period, 
and described the error properties of the PCA coefficients. 
PCA template-fitting was also successfully applied to HST- 
observed Cepheids in NGC1637 by Leonard et al. (2003). 

In this paper we provide a complete description of the 
PCA-based method of characterising light-curves, present an 
updated training set and consider in detail the subsequent 



template-fitting algorithm which was used in Tanvir et al. 
(1999) and Leonard et al. (2003). We describe simulations 
which illustrate the potential of the methods, and discuss 
future directions. Although we focus on their application to 
V and / band Cepheid data, these techniques may easily be 
extended to other passbands and also used to analyse other 
classes of periodic variable stars. For example, Kanbur and 
Mariani (2004) consider PCA of photometric data for RRab 
stars. 

The structure of the paper is as follows. In Section 2 
and 3 we present our training set of well-observed MW and 
LMC Cepheids and describe in more detail how PCA is ap- 
plied in order to define the template light-curves, and the 
advantages of this approach over other methods. In section 4 
we discuss an algorithm for fitting templates to noisy data in 
order to estimate light-curve parameters and their errors. Of 
course, such a procedure is required however the templates 
are generated, but in our case the fitting process also returns 
estimates for the coefficients of the first two principal com- 
ponents. In Section 5 we then go on to generate simulations 
of poorly-sampled Cepheids with noisy photometry, mim- 
icking "typical" and "difficult" HST data sets, and extract 
their parameters by template-fitting. We consider distance 
determination and light-curve parameter estimation using 
both mean- and maximum-light estimates. This serves to 
illustrate how our method performs in practice compared 
to other methods, and also allows us to explore the limits 
of HST-like data-sets. In section 6 we introduce an SMC 
Cepheid sample, and consider the question of whether light- 
curve shape, for either individual Cepheids or averaged over 
populations, contains other useful information - in particu- 
lar its potential as an indicator of metallicity. Our conclu- 
sions are given in section 7. 



2 PRINCIPAL COMPONENT ANALYSIS OF 
OUR TRAINING SET 

PCA is a widely used statistical tool and has been applied 
in recent years to a number of astrophysical problems, such 
as spectral classification, photometric redshift determina- 
tion and morphological analysis of galaxy surveys (eg. Li, 
Kong and Cheng, 2001). For a detailed account of the sta- 
tistical basis of PCA the reader is referred to e.g. Morrison 
(1967). The central principle behind PCA is easily stated, 
however: it provides a means of transforming a multidimen- 
sional dataset consisting of a number of statistically depen- 
dent variables into a set of statistically independent vari- 
ables, which are the principal components. Specifically, the 
first principal component is determined to be the linear com- 
bination of the original variables which accounts for as much 
of the variability in the data as possible; the second princi- 
pal component is the linear combination which accounts for 
as much of the remaining variability as possible - subject 
to the constraint that it is orthogonal to the first principal 
component - and so on. In many situations the first few 
principal components may explain a high proportion of the 
variability in the data, so that one may substantially reduce 
the number of variables used to describe the data set with 
very little loss of information. 

Our starting point is a calibrating set of 127 Cepheids, 
with periods P > 10 days and high-quality well-sampled V 
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Figure 1. Distribution of log period for our 'training set' of 127 
Cepheids. 



band and / band light-curves. We used this 'training set' 
to establish relationships between multicolour light-curve 
shape (LCS) and period. The training set consists of: 

• 61 Galactic Cepheids with photometry from Berdnikov 
(unpublished data-base) , Berdnikov & Turner (1995), and 
Moffett & Barnes (1984) 

• 66 LMC Cepheids, covering a wider period range 
with photometry primarily from the OGLE catalogue 
of Classical Cepheids (Udalski et al. 1999a; web archive 
at |http://bulg e. princeton. edu/~ogle/ogle2/cep _lmc.html| 
Fourier analysis from Ngeow et al. 2003), but supplemented 
by data taken from various sources, particularly Moffett et 
al. (1998). 

Figure shows the distribution of periods in our sample. 
Note that we do not include any SMC Cepheids in our train- 
ing set; this reflects the fact that the metallicity of target 
galaxies for e.g. HST Cepheid distance estimation is gener- 
ally significantly higher than that of the SMC. We do con- 
sider SMC Cepheids in Section 6, however, in our discussion 
of light-curve shape characterised by PCA as a possible di- 
agnostic of metallicity. 

To apply PCA to this sample, we first Fourier analyse 
the photometric data (for those variables not already anal- 
ysed by Ngeow et al.), up to eighth order, ie. perform a 
least-squares fit of the following: 



m(t) — mo + S ' a?; sin(2-7rfct/P) + bfc cos(2ivkt/P) 



(1) 



The coefficients of the Fourier terms constitute a vector 
consisting of 32 elements for each member of the training set 
(i.e. 8 sine amplitudes and 8 cosine amplitudes for both the 
V and I bands - but note that the phase is shifted such that 
the first cosine term in V is always zero. Note also that the 
mean V and / magnitudes, the do terms, of the calibrating 
Cepheids are not included in the PCA since they are distance 
dependent). 

The mean V and I light-curve shape is established sim- 
ply by averaging these vectors. PCA is then applied to the 
whole set of residual vectors (ie. with the average vector 
subtracted) in order to determine the most significant vari- 
ations from the mean LCS. Full numerical details of the 
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0.827 
0.890 
0.917 



Table 1. Proportion of the total variance in the calibrating set 
explained by the first few principal components. (In PCA the 
variance associated with each component is equal to the corre- 
sponding eigenvalue of the covariance matrix) . 



analysis are given in appendix A. Incorporating both V and 

1 data in each vector means that the correlations between 
the coefficients in each band are automatically encoded in 
the resulting analysis. This could, of course, be extended to 
more bands, but we restrict ourselves to V and I here since 
only those filters have been used in the large majority of 
HST studies. 

In practice, the first principal component largely reflects 
simple variations in amplitude. Subsequent components en- 
code more subtle light-curve shape information, such as 
"bumps". Of course, sets of Fourier amplitudes are not the 
only vectors which could be used as input to the PCA. One 
could, for example, work directly with the observed V and I 
magnitudes for each calibrator, smoothed and interpolated 
onto a regular grid of phase values. We find, however, that 
the use of Fourier components as input vectors works very 
well, naturally incorporating a degree of smoothing of the 
input data and providing a link with previous approaches to 
LCS analysis. 

Table 1 shows for our calibrating set the proportion 
of the variance explained by the first few principal compo- 
nents. We can see from this table that one requires only a 
few components to explain a large proportion: for example, 
the first three principal components account for 89 per cent 
of the variance of LCS within the sample. Moreover, since 
the observed scatter in the data includes the effects of pho- 
tometric errors and finite sampling on the estimated Fourier 
coefficients which are input to the PCA, the proportion of 
the intrinsic variation in LCS explained by the first three 
principal components will, in fact, be even higher than 89 
per cent. 

Figure [5] shows an example of two Cepheids from the 
OGLE data-set with very good phase coverage. This illus- 
trates that excellent light-curves are reconstructed from just 

2 PCA terms. Since these reconstructed curves incorporate 
information from the whole training set, they necessarily re- 
flect average Cepheid behaviour, and don't fit perfectly any 
individual Cepheid. However, this has the advantage that 
they don't follow noise in the data either, as the Fourier fits 
in the V band are beginning to do in these examples. 



3 DERIVING TEMPLATE LIGHT-CURVES 

With the PCA coefficients for each variable in hand, we can 
plot them as a function of period, as shown for the first four 
principal components in Figure [3] This figure reveals some 
important trends - notably that the behaviour of the first 
principal component, as expected, is similar to a simple plot 
of amplitude versus period, with a peak at around P = 30 
days (see e.g. Schaltenbrand & Tammann 1972). There is 
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Figure 2. Light-curves for two LMC Cepheids of different periods (OGLE data). The Fourier fits are shown as a grey line (ie. 32 terms 
describing both V and and the reconstructed PCA curves with 1, 2, 3 and 4 terms are shown in successively thicker, black lines. We 
emphasise that V and / are simultaneously described by the PCA curves since the analysis is performed on the combined data-set. Note 
that the fits, whilst not perfect, are very good, and that in fact beyond 2 terms further changes in the PCA curves are almost entirely 
within the thickness of the line. 



clearly also systematic structure in the plots for the coef- 
ficients of the second and third principal components. By 
the fourth component the distribution of coefficients is be- 
coming increasingly dominated by noise, although small but 
statistically significant correlations of the coefficients with 
log(P) are seen up to at least 8 PCA terms. 

We have fitted low-order polynomials through these 
scatter plots, to define "typical" values of the PCA coef- 
ficients for a Cepheid of given period, and also obtain some 
estimate of the spread around these typical values. The poly- 
nomial fits are shown by the solid curves in each of the pan- 
els of Figure^] terminating at log(P/days) = f .8 where the 
number of training Cepheids becomes very few. 

Before discussing the construction of template light- 
curves, it is interesting to compare the distribution of co- 
efficients for the Milky Way and LMC subsamples. For the 
first and third principal components we can see from Figure 
|H]that the distributions appear to be well mixed, with no ob- 
vious distinction between the two samples. This is broadly 
consistent with the results of Kanbur and Ngeow (2004), who 
obtained period-colour and amplitude-colour relations for a 
similar sample of MW and LMC OGLE Cepheids. While 
those authors found strong evidence for a difference in the 
slope of these relations between long- (i.e. P > 1 days) and 
short-period Cepheids, they did not find a statistically sig- 
nificant difference in the long-period sample slopes between 



the LMC and MW. For the second principal component, 
on the other hand, at a given period the LMC coefficients 
typically appear to be less than those for the Milky Way. 
The distributions are clearly not disjoint, so that the adop- 
tion of a single polynomial fit to describe the structure in 
the combined training set is reasonable. Nevertheless, Fig- 
urc|2]suggests that the second principal component, at least, 
might be a useful discriminator between different Cepheid 
samples. We discuss this point further in Section 6 below. 

In order to generate a realistic Cepheid light-curve tem- 
plate all that is now required is to read off the PCA coeffi- 
cients corresponding to the desired period according to the 
polynomial fits in Figure|3] and hence, with knowledge of the 
PCA vectors and the average light-curves, reconstruct a full 
sequence of Fourier terms. We emphasise again that because 
V and / Fourier coefficients are both included in each vector, 
the light-curves in each band are reconstructed simultane- 
ously. An extra degree of sophistication can be achieved by 
considering the PCA coefficients within a range around the 
polynomial fit corresponding to the scatter in the training 
set. In this case we find not a single template at a given 
period, but a whole family of allowable light-curves. 

Recalling that the primary motivation of the present 
study is to extract optimal light-curve parameters from ob- 
served noisy data-sets, the next challenge is to find the best 
fitting template light-curves to such a data-set when the 
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Figure 3. PCA coefficients plotted as a function of log period in days, for our training set of 127 Milky Way (light stars) and LMC 
Cepheids (darker squares). The four panels illustrate the diminishing strength of successive principal components, and by plotting against 
log(P) also reveal the systematic trends in light curve amplitude and shape with period. Low order polynomial fits are overplotted which 
will be used to obtain typical coefficients at a given period. The small but real difference between the distributions of LMC and MW 
points is discussed further in the text. 
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period and other parameters are a priori unknown. Our so- 
lution to this problem is described in more detail in the next 
section. 



4 LIGHT-CURVE PARAMETER ESTIMATION 
VIA TEMPLATE-FITTING 

To characterise the Cepheid light-curves of sparsely- 
sampled, noisy data our approach is to find the best-fit PCA 
templates, simultaneously in V and I, determined from a 
search of the plausible parameter space in period, V and I 
mean magnitude, phase and PCA coefficients. Determining 
period, phase etc. using both bands is unusual, but makes 
most efficient use of all the data. 

In practice - mainly for computational expediency - 
only the first two PCA coefficients are allowed to vary. The 
higher PCA coefficients are set to zero, but as seen above 
(Figure using more than two PCA terms only modifies 
the light-curves at a subtle level. 

The fitting procedure for an individual Cepheid is sum- 
marised as follows. 

(i) Loop over a large range of trial periods 
(usually between 10 and 65 days with steps of 
0.001 in the log) 

(ii) for each trial period, loop over a range 
of PCI and PC2 coefficients around their typical 
values for that period, thus generating template 
light-curves in V and I for each pair of values of 
the PCI and PC2 coefficients. 

(iii) for each pair of templates, find the values 
of phase and intensity-mean magnitudes in V and 

I which minimise the y 2 statistic, where the \ 2 
is defined as the sum of the squared deviations 
(normalised by the photometric errors) between the 
observed V and I data-points and the magnitudes 
predicted by the templates . This procedure 
utilizes the Amoeba algorithm (eg. Press et al . 
1992) . 

(iv) move to next pair of PCI and PC2 
coefficients . 

(v) move to next trial period. 

At the end of this process, the trial period and V 
and I light-curves with the overall lowest \ 2 i- s 
then assumed to provide the best estimates of all 
the parameters . 

In practice we plot exp(— x 2 re d/2), as an indicator of rel- 
ative likelihood, against test period. This plot reveals: firstly, 
the best period (the peak position); secondly, the goodness 
of the best fit (the height of the peak); thirdly, how well the 
period is determined (the width of the peak) ; and, fourthly, 
whether there are any potential aliases at completely dif- 
ferent periods (essentially the height of the second highest 
peak) . As we show later, this information can be very useful 
in deciding objectively which variables to include and which 
to exclude in a period-luminosity analysis. 



4.1 Estimating the distance modulus and its error 

To estimate the distance modulus of the Cepheid we now 
ignore the data-points and calculate intensity-mean magni- 
tudes, (V), (I), of the fitted templates themselves. These 
are compared to the absolute magnitudes calculated for the 
given period using the calibrating PL relations. The colour- 
excess is used to correct for extinction following the proce- 
dure detailed by Tanvir (1997), and hence an estimate of 
the true, unreddened distance modulus, fj,o, is made. 

We can also obtain in a straightforward manner an es- 
timate of the uncertainty on fio by computing the posterior 
distribution, p(^o|data), of fio given the observed V band 
and I band data. Note that this procedure accounts for in- 
ternal errors due to photometric noise and finite sampling, 
but not to errors in the original templates (likely to be rel- 
atively small in practice) or calibration errors (which must 
be estimated independently). 

We proceed as follows: Let <j>, P, PCi, PC2 denote re- 
spectively the phase constant, period and the first and sec- 
ond PCA components of the underlying light-curves. Ignor- 
ing the higher order PCA coefficients, and also neglecting 
for the moment the impact on the light-curve shape of other 
stellar parameters such as metallicity (see below), we assume 
that these four parameters, together with the intensity-mean 
magnitudes, (V) and (I), completely specify the V and I 
band light-curves. Note, moreover, that under this approxi- 
mation P is uniquely defined by the values of (V), (I) and 
Ho, so that we need not consider the period as an inde- 
pendent parameter 1 . To simplify notation, we denote the 
remaining light-curve parameters collectively by the column 
vector A; i.e. 

A= (00, (/),«/>, PCI, PC2) T (2) 
Formally, we may then write 

p(/io|data) = J p((iQ, A|data)dA (3) 

i.e. we marginalise p(/io, A|data) over the other independent 
parameters. 

To simplify matters we assume that PCi and PC2 are 
equal to the values determined previously for the glob- 
ally best-fitting template. In practice, we also assume that 
p(/xo, A|data) = unless <j> is equal to its best fit estimate 
for the period determined by the particular values of /Lto, (V) 
and (I) This allows eq.[3]to be rewritten as 

p(^o|data) = J p(fio,A\d&ta)d(V)d(I) (4) 

which, in turn, can be approximated by a sum over a series 
of 'trial' values of (V) and (7). From Bayes' theorem we may 
write 

p((j,o, A|data) = p(data|^o, A)p(/no, A) (5) 

where the first term is the likelihood function, expressing 
the probability of obtaining the observed photometric data, 
given a set of light-curve parameters and a Cepheid at dis- 
tance modulus jtto, and the second term is a prior distri- 
bution for those parameters and for the distance modulus. 

1 In other words we assume that the Cepheid lies exactly on the 
fiducial V and I linear PL relations 
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Assuming a flat prior for p(/xo, A), equation^Jmay be further 
reduced to 

p(/xo|data) oc ^ ^p(data|/i , (V) 3 , (I) k , 4>, PCi, PC 2 ) (6) 

3 fc 

where (V) ■ and (7} fc denote a series of (equally spaced) trial 
values of the mean V and 7 band magnitudes. One can, if 
appropriate, easily generalise equation to the case of a 
non-uniform prior and a non-uniform grid of (V) ■ and (7)^ 
values. 

Assuming that the photometric errors are normally dis- 
tributed, finally we obtain 

p(/u |data) oc exp(-Xj 2 fc ) (7) 

j k 

where Xjk i s the chi-squared obtained from comparing the 
observed and predicted magnitudes, given values of fi, (V) j , 
(7) fc , <f>, PCi and PC2. In fact, we could take as our estimate 
of no the value which maximises the posterior likelihood 
p(/Lto|data) - or equivalently the value which minimises x% 
- however, in practice these values differ from those for the 
individual best-fit template by only 1 or 2 hundredths of 
a magnitude in distance modulus. Instead we use the fact 
that p(/^o|data) should be properly normalised, to compute 
o ^ , the uncertainty in the estimated distance modulus from 
the width of the resulting likelihood function. Notice that 
this analysis gives reasonable estimates of the uncertainty 
providing the estimated period itself is not greatly in error. 
If, in fact, the best fit period is an alias then the true error 
on the distance modulus is likely to be significantly larger. 

The above analysis can be readily extended to include a 
metallicity dependence in the shape of the light-curve tem- 
plates and in the PL relations. In this case the period, P, 
would be determined by the values of fi, (V), (7) and metal- 
licity, Z, which itself might be estimated along with the 
other independent parameters by the template-fitting ap- 
proach. Such an extension in practice seemed inappropri- 
ate for the training set considered in this paper, since it 
contained Cepheids from different metallicity environments 
(but see Section 6, below). The study of metallicity effects 
using PCA is straightforward in principle, however (see Kan- 
bur et al. 2002). Moreover, one could extend the model for 
the prior distribution, p(fio, A), of light-curve parameters to 
include a dependence on other fundamental stellar parame- 
ters, such as effective temperature and mass, reflecting one's 
state of knowledge about e.g. the width of the instability 
strip, initial mass function and mass luminosity relation for 
Cepheids (see, for example, Kochanek 1997 for an example 
of such a prior model). 



5 SIMULATIONS OF SPARSE AND NOISY 
LIGHT-CURVES 

Another use of our template light-curves is to provide the 
underlying models for production of artificial Cepheid pho- 
tometry. In this section we describe simulations designed 
to resemble the sparse and noisy photometry from typical 
HST Cepheid monitoring campaigns. We then run the PCA 
template-fitting program on these data-sets to obtain max- 
imum likelihood estimates of the parameters for each simu- 
lated Cepheid, and hence establish how accurately the input 



parameters are recovered. We also compare the template- 
fitting results to those obtained from more traditional pa- 
rameter estimation methods. Although the simulations are 
realistic, they are not designed to replicate specific cases of 
HST-observed Cepheids in external galaxies but rather to 
represent generic examples similar to those found by most 
HST studies. Moreover, we have not compared our template- 
fitting method with all algorithms which have been used to 
determine Cepheid light-curve parameters - partly because 
many such studies have involved some degree of subjectiv- 
ity, for example in selecting the Cepheids themselves, which 
is hard to replicate. 

We chose to simulate data for Cepheids of 
log(P/days) = 1.4 (about 25 days), being typical of 
the variables observed in the HST programs, and in the 
middle of the range for which the sampling strategy is 
optimised. Phases were random, and PCA coefficients 
chosen to be as those for real Cepheids (ie. based on the 
polynomial fit to the training set) at the period in question. 
The sampling of the light curves was carried out using a 
particular sequence of observations based on that adopted 
by the HST 77o Key-Project group Specifically this meant 
12 epochs of observation in V and 4 epochs in I. Realistic, 
magnitude-dependent photometric noise was added to the 
data-points, with one set of 400 simulations with error bars 
on each point around 0.1 to 0.2 mag, being representative 
of "typical" HST data, and another set of 1000 simulations 
representing "difficult", low signal-to-noise ratio (S/N) data 
with error bars more like 0.15 to 0.3 mag per data-point. 
The latter set approximates the worst-case data which have 
been obtained in some HST studies. 

5.1 Results of simulations 

To provide a benchmark we first analysed each simulated 
data-set with the methods of period finding via string- 
length minimisation (Lafler and Kinman 1965) and phase- 
weighted intensity-mean magnitude estimation (Saha and 
Hoessel 1990). We then analysed the same synthetic data 
using our PCA template approach described above. For the 
sake of brevity we henceforth refer to these as the "OLD" 
and "NEW" algorithms respectively, whilst clearly recognis- 
ing these particular OLD methods are by no means the only 
ones used in previous studies. They do, however, have the 
benefit of being easily and mechanically applied to the data. 

We consider first the simulations of "typical", moder- 
ate S/N data. Examples of the simulations and period de- 
termination are given in Figure |1] for the OLD algorithms, 
and Figure 03 for the PCA template-fitting algorithm. Both 
approaches work well in this example, in the sense of cor- 
rectly identifying the period. The fact that subtle light-curve 
shape information is largely erased by this degree of sparse- 
sampling and noise addition means that the use of the same 
templates for both creating and fitting to the simulated data 
should not significantly bias the results. We checked this 
expectation by running further noisy simulations but this 
time starting with the observed light curves of individual 
Cepheids from the training set which were outliers from the 
curves in Figure^] The results and level of improvement with 
template-fitting were qualitatively similar to those reported 
above, with the proviso that both algorithms (especially the 
OLD ones) do somewhat better when the overall Cepheid 
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Figure 4. Example of artificial Cepheid photometry simulated 
to have S/N typical of HST data and analysed with the OLD 
algorithms. The left-hand panel shows the Lafier-Kinman (1965) 
string-length measure applied just to the V band data-points and 
plotted as a function of trial period - the best period being in- 
dicated by the minimum string length. Since the input period is 
log(P/days) = 1.4, the method obviously works well for this sim- 
ulation. The worst aliases tend to occur at half the true period 
(ie. offset by about 0.3 in the log), but in this case (and in fact 
for nearly all the simulations with this S/N) the worst alias is 
not as good a fit as is the correct period. The right-hand panels 
show the data-points folded on this best period. Note that the 
zero-points for the magnitude scales are chosen arbitrarily. 




1.2 1.4 

log(P/ days) 



Folded epoch (days) 



Figure 5. The same simulated data shown in Figure |41 analysed 
with the NEW, template-fitting algorithm. In this case we look for 
a peak in the relative likelihood curve (left-hand panel) to identify 
the best fitting period (see description in text). The period found 
by the algorithm is very close to that input in the simulation, 
log(P/days)=1.4, and again a small alias is seen to appear near 
P/2. The right-hand panels again show the data folded on this 
best period. 



amplitude is larger rather than smaller. In other words, as 
one would expect, the outliers with PCI above the average 
are easier to find periods for than the outliers with PCI 
below the average. 

In Figure [5] we summarise the results of all 400 sim- 
ulations in histograms showing the returned periods and 
inferred distance moduli. The extinction-corrected distance 
moduli are calculated using P, (V) and (I) as described by 
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Figure 6. Histograms of the results from 400 simulations with 
"typical" HST S/N. The input values are indicated by vertical 
dotted lines, namely log(P/days) = 1.4 and /iq = 30. The latter 
is chosen arbitrarily as being typical of HST studied galaxies. 
The upper panels are for template-fitting and the lower panels 
using the OLD algorithms. Both period and distance modulus 
are well determined with this S/N, as indicated by the solid dot 
and bar which represent the mean and standard deviation of each 
distribution (numerical values are printed next to the bar). The 
number of occasions where an alias period is wrongly identified 
as the true period is negligible. 



Tanvir et al. (1997), which is essentially the same method 
as used by the HST Ho Key-Project group. Neither method 
is confused by aliases with this S/N data, although the 
template-fitting produces rather more accurate periods and 
distance moduli. Specifically the la rms scatters for the 
NEW and OLD methods are 0.16 and 0.23 mags respec- 
tively, suggesting that the uncertainties in distance modulus 
resulting from light-curve parameter estimation for samples 
of several tens of Cepheids should only be a few hundredths 
of a magnitude. 

For each fit to the simulated data-sets we also evaluated 
the uncertainty in distance modulus as described in section 
4.1. The average turned out to be 0.15 mag, very close to the 
observed scatter, giving confidence that the errors returned 
by the template-fitting procedure itself are realistic. 

The situation with the "difficult", low S/N data is il- 
lustrated in Figure [7| for the OLD algorithms and Figure 
|H]for the NEW template-fitting ones. In this case we show 
four examples to highlight the fact that now results range 
from cases where the input parameters are well recovered, 
to instances where the best fit is actually obtained with an 
alias period. 

This behaviour is summarised in Figure[2]for 1000 sim- 
ulated data-sets which shows that now both techniques pro- 
duce the occasional period aliases. In general terms the pe- 
riods and distance moduli show more scatter than was the 
case with higher S/N, although overall an accuracy of about 
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Figure 7. Examples of "difficult", low S/N simulated data analysed with the OLD algorithms. The panels are similar to those in Figure 
l4l The four cases were chosen to illustrate a range in behaviour, with the top left and bottom right variables suffering from bad aliases 
at half the true period. Note, in the latter case the folded data-points do not trace a very Cepheid-like light-curve. 



0.4 mag per Cepheid in distance modulus is still reasonably 
good. Once again the template-fitting performs a little bet- 
ter than the OLD algorithms, but both reveal a slight bias to 
lower distance moduli, over and above the increased scatter. 

However, we must be cautious in interpreting these re- 
sults for a number of reasons. In practice, because of the 
low S/N, some of our simulated variables would probably 
not have been classified as variables in the first place had 
they appeared in an HST study. Furthermore, bad fits would 
often be rejected as not being sufficiently "Cepheid-like" . It 
wouldn't be surprising if such cases of bad template fits also 
produced the most discrepant distance moduli. 

In order to assess these effects, and also make the test 
more realistic, we clipped the sample of simulations to ex- 
clude those for which the degree of scatter of the data-points 
about a constant, non-variable line was such that it would 
only occur by chance 1 time in 5000. In other words we 
insisted (as do most Cepheid studies one way or another), 
that the threshold for treating a star as a variable is high 
enough that very few non-variable stars ever exceed it by 
chance. Further we set an upper limit to the acceptable re- 
duced x 2 rc d f° r the template fit of 1.3, which means that 22 
per cent of true Cepheids will be lost, but ensures that only 
those which fold to produce genuine "Cepheid-like" light 
curves are retained. Finally, we rejected any variables for 
which there was an alias period with a x 2 rc d < 1-5 which 



was separated by less than 0.1 in log(P) from the highest 
likelihood peak. This procedure loses a few fits which pro- 
duced accurate periods, but is particularly good at removing 
probable aliases. 

The results of this whole clipping procedure are shown 
in Figure HT71 As expected, the scatter in the results of both 
methods is reduced, but in particular problems with aliases 
for the template-fitting are largely removed, as is the bias in 
distance modulus determination. 

Viewed as a whole the simulations permit the following 
conclusions: 

• Template-fitting results in a roughly 30 per cent re- 
duction in scatter in estimates of distance modulus for 
the "typical" S/N data compared to the OLD methods of 
string-length minimisation to determine periods and phase- 
weighted averaging to obtain intensity-mean magnitudes. 

• There is a tendency to slightly underestimate periods 
for "difficult", low S/N data. Again template-fitting does 
somewhat better than string- length. This small bias in pe- 
riod also leads to a small bias in distance modulus. In 
fact, we have also performed simulations for longer period 
log(P) = 1.7 Cepheids (ie. around 50 days), although not re- 
ported here in detail, and find this underestimation becomes 
a little worse. This is not surprising since the period is now 
approaching the total length of the observing sequence, and 
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Figure 8. The same simulated data as Figure [7| analysed with the NEW, template-fitting algorithms. Again the top left case hits the 
same problem with an alias period providing a better fit that the true period. However, in the bottom-right case this time, the fact that 
we are fitting a template rather than simply minimising string-length has correctly identified the period. 



is only a little below the maximum period employed in the 
trials. 

• Both approaches perform respectably even for low S/N 
data, but with a increasing incidence of period aliases. How- 
ever, not only does the template-fitting do somewhat better, 
it also computes a goodness-of-fit (x 2 ) for the template fit, 
providing an objective way of selecting the variables for in- 
clusion in the analysis. A sample culled on the basis of only 
including good template fits reduces the scatter by a factor 
of around 2 compared to the traditional OLD methods. 

• The distance modulus uncertainties calculated as de- 
scribed in section 4.1 are on average very good. For example, 
for the 25 day period simulations the average uncertainty 
calculated for the typical S/N case was 0.15 mag compared 
to the actual dispersion around the true (input) value of 0.16 
mag. For the low S/N data the numbers are 0.28 mag for 
the estimated errors compared to 0.31 mag as the dispersion 
for the clipped sample of simulations. 

While these are interesting results, and establish the 
utility of the PCA template-fitting method, we caution that 
they do not imply significant problems with the results ob- 
tained by the various groups reporting HST Cepheid obser- 
vations in the past. For one thing, for the typical S/N pho- 
tometry, the NEW algorithm only performed a little better 
than the OLD ones. Furthermore, most recent studies have 
not simply adopted the OLD methods considered above, but 



have also either applied "chi- by-eye" rejection of doubtful 
variables, and/or performed other variants on the template- 
fitting scheme. The results of our simulations reinforce the 
conclusion that such template-fitting has many benefits, but 
we have also shown that our procedure has a more rigorous 
statistical basis and provides a more efficient means of en- 
coding relevant light-curve shape information than previous 
methods. 



5.2 Estimating Maximum-Light from Template 
Fits 

Various authors have suggested that Cepheids at maximum- 
light may be as good, if not better, standard candles than 
Cepheids at intensity-mean-light (e.g. Sandage and Tam- 
mann 1968; Kanbur and Hendry 1996; Kanbur et al. 2003). 
Aside from arguments based on intrinsic physical proper- 
ties, another advantage could be that maximum-light may 
be more precisely determined than mean-light if the Cepheid 
is faint and hence poorly observed through minimum. 

However, maximum-light has rarely been used in prac- 
tice, perhaps partly because many epochs are required to 
give a decent chance of sampling close to the maximum. One 
also loses some of the benefit of averaging many observations 
to reduce noise. Obtaining estimates of maximum-light from 
template fits may be a way of benefiting from the advan- 
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Figure 9. Histograms of the results from 1000 simulations of "dif- 
ficult", low S/N data. Compared to Figure|H]we see that the dis- 
tribution of returned periods and inferred distance-moduli shows 
more scatter, a small but non-negligible contamination by aliased 
periods and a small but statistically significant bias toward lower 
values of /iq. 



tages while not suffering the disadvantages. All the data is 
used, with appropriate weighting, and reasonable estimates 
of maximum-light can be obtained without requiring dense 
sampling over the maximum itself. 

We have also tested our ability to estimate maximum- 
light using the template fits to the simulated data. The re- 
sulting histograms are actually so similar to the ones pre- 
sented for mean-light that we feel they are not worth show- 
ing separately (mean- light is very marginally better). Ad- 
mittedly here the fact that the same templates are used to 
create the artificial noisy data in the first place and then 
to fit to it, may make the simulation results appear slightly 
rosier than reality. But the point is clear: although we find no 
evidence that maximum-light is superior to mean- light, it is 
certainly reasonable to use maximum-light with template- 
fitting. Interestingly we also find a strong correlation be- 
tween the maximum-light and mean-light distance moduli 
for individual simulated Cepheids, indicating, perhaps un- 
surprisingly, that they encode very similar information and 
can't therefore be combined in any way to provide an im- 
proved distance indicator. 



6 LIGHT-CURVE SHAPE AND METALLICITY 

An obvious question which arises from our analysis thus far 
is whether light-curve shape is sensitive to physical parame- 
ters other than just period. It would be particularly useful, 
for example, if light-curve shape were found to be sensi- 
tive to a property - such as metallicity - which is expected 
to correlate with the absolute magnitude of a Cepheid (see 
e.g. Caputo et al. 2000, and references therein). Quantifying 
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Figure 10. The same information as in Figure l9l but this time 
clipped of the low amplitude variables and those with badly fitting 
light-curves. This process mimics what is often done in practice, 
selecting candidates above some threshold criterion for variability, 
and rejecting those which don't appear "Ccphcid-like" . In addi- 
tion, as described in the text, we have removed variables with 
potential aliases, which is a fairly well-defined, automated proce- 
dure (ie. based on the height of the second highest peak in the 
relative likelihood plot for the variable) when doing template- 
fitting. These steps obviously reduce scatter, remove many of the 
aliases, particularly for the template-fitting, and largely remove 
the small bias. The price which is paid is the loss of 12.5 per cent 
of the variables used for the OLD algorithm analysis, and about 
46 per cent of the variables used for the template-fitting. 



the (reddening corrected) sensitivity of Cepheid distances to 
metallicity has proven hard, but estimates have tended to 
be in the region of 0.2 mags in distance modulus (~10 per 
cent in distance) for a factor 10 in metallicity (eg. Sakai et 
al. 2004). In most HST studies, the Cepheid metallicity is 
estimated from spectroscopy of the gas phase - sometimes 
from nebulae in other parts of the galaxy from the Cepheids. 
Directly constraining the metallicity of the Cepheid sample 
itself via observations of light-curve shape would have obvi- 
ous advantages over this approach. 

In fact, Paczynski and Pindor (2000) already pointed 
out that OGLE observed Cepheids in the SMC have sys- 
tematically lower amplitudes than those in the LMC in the 
period range 1.1 < log(P) < 1.4, which they suggest is likely 
to be a metallicity effect. Similarly, Kanbur et al. (2002) 
presented evidence suggesting a difference in the average 
light-curve shape of SMC and LMC first-overtone Cepheids, 
based on a principal component analysis of densely sampled 
V and / Cepheid light-curves from the OGLE (Udalski et 
al. 1999a,b) and EROS (Beaulieu et al. 1995) microlensing 
surveys. Figure 12 of Kanbur et al. shows, as a function of 
period, the coefficients of the first and second principal com- 
ponents for OGLE first-overtone Cepheids with periods less 
than 10 days. A comparison from that figure of the PCA 
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Figure 11. PCA coefficients plotted as a function of log period, for our training set of Milky Way (light stars), LMC (darker squares), 
extended by the addition of a sample of SMC (dark triangles) Cepheids. The PCA basis used is that of Figure [3] (ie. the SMC data 
have been decomposed onto this basis). Separate low-order polynomial fits are shown for each set of points, illustrating the systematic 
differences in light-curve shape from galaxy to galaxy, presumably due to metallicity differences. The large error bars show the spread 
in derived PCA coefficients expected from template-fitting to "typical" noisy data, as estimated from our simulations. This indicates 
that photometry for individual HST-observed Cepheids will not usually be good enough to quantify metallicity, but averaging together 
a reasonable sample for a particular galaxy might be. 
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coefficient distribution for LMC and SMC Cepheids shows 
some clear differences - most notably that the distribution of 
first principal component coefficients for the SMC Cepheids 
generally lies below that for the LMC Cepheids. Hence the 
PCA approach picks out changes in the structure of overtone 
light-curves which correlate with metallicity. 

Figure 1111 shows the distribution, as a function of log 
period, of the first four principal component coefficients of 
the same MW and LMC Cepheids as Figure|3]together with 
a further 52 Cepheids from the SMC. The data for the latter 
are largely from the OGLE database (Udalski et al. 1999b) 
and MofTett et al. (1998). Note that the SMC light-curves 
have been decomposed onto the PCA basis established from 
the MW/LMC data rather than being included in an ex- 
panded training set. 

Addition of the SMC data clearly increases the spread 
at a given period, but most of this appears to be systematic 
rather than increased random scatter. In particular, the PCI 
coefficients are generally lower than those of the MW and 
LMC, confirming the reduced amplitude noted by Paczynski 
and Pindor (2000). Particularly prominent are a group of five 
SMC Cepheids with log(()P/days) ~ 1.5, although these 
Cepheids are not significant outliers in the PC2 coefficient 
distribution. 

For the PC2 coefficients, the SMC points are again gen- 
erally lower, as is most clearly seen by looking at the separate 
polynomial fits for each galaxy. Of course, given our partic- 
ular focus on HST Cepheid studies, our training set consists 
only of fundamental mode Cepheids with period, P > 10 
days. Thus Figure [Til is not directly comparable to Figure 
12 of Kanbur et al. (2002). Unlike the plots we presented 
for the MW and LMC alone (Figure [3J we now also see that 
PC3 and PC4 coefficients for the SMC tend to follow the 
upper envelopes of distributions. 

The metallicities of LMC and SMC Cepheids are 
thought to be about 50 per cent and 20 per cent of the 
MW Cepheids respectively. If we assume that the observed 
differences in PCA coefficients are due to metallicity then 
it may provide a route to estimating the average metallici- 
ties of other samples of Cepheids directly. The feasibility of 
this is indicated by the bold vertical bars in the top panels 
of Figure [TTI These show, at three different values of logP, 
the la spread in returned PCI and PC2 coefficients for the 
simulated "typical S/N" data. (Given the typical V and I 
band sampling of HST Cepheid observations, it would be 
unrealistic to extract reliable light-curve shape information 
from the third, or higher, principal components.) It is appar- 
ent that for individual Cepheids only very weak constraints 
can be placed with this quality of data. However, with better 
data or by averaging a reasonable sample of Cepheids, a use- 
ful, direct diagnostic of metallicity may well be achievable. 
We intend to investigate further the dependence of PCA on 
metallicity in a future paper. 



7 CONCLUSIONS 

We have presented in some detail our techniques to charac- 
terise Cepheid light-curves using principal component anal- 
ysis of the Fourier coefficients for a set of well-observed 
Cepheids in the LMC and MW. We have also described how 
light-curve parameters can be extracted by fitting these tem- 



plates to sparse and noisy data, and illustrated the method 
with extensive simulations. 

The advantages of this approach are (i) very realistic 
light curves as a (smooth) function of period are obtained 
with only one or two principal components - in fact they 
are frequently better than the full Fourier fits to the cali- 
brating data since averaging over the full set of Cepheids re- 
moves some numerical noise; (ii) multicolour data can be ac- 
commodated with a single combined fit which automatically 
accounts for the correlations between bands; (iii) template 
fitting to all data (weighted by the errors on each measure- 
ment) makes optimal use of the information in determining 
light-curve parameters; (iv) variables with poor fits (which 
might be produced by non-Cepheids or those whose pho- 
tometry is badly affected by crowding) and potential pe- 
riod aliases are easily identified, and hence can be removed 
from consideration by applying objective, statistical criteria 
(rather than, for example, by visual inspection, as has of- 
ten been the case in the past); (v) errors can be estimated 
in a moderately rigorous way, and Cepheids can be selected 
on the basis of goodness-of-fit; and (vi) maximum-light is 
straight forward to estimate and can be used as an alterna- 
tive to mean-light in the PL relation. 

The simulations themselves show that most Cepheids 
observed in HST campaigns should individually give dis- 
tance moduli to about 0.2 mag (with the template fitting 
doing somewhat better than less sophisticated approaches), 
indicating that for typical sample sizes (several tens of vari- 
ables), random errors in the derived Cepheid parameters 
should only be at the few per cent level for the sample as a 
whole, and systematics are likely to be the dominant source 
of uncertainty. Interestingly we have found that even with 
very poor S/N data, errors can be as little as 0.3 mag for 
individual Cepheids using template fitting. 

Finally we note that these methods can easily be ex- 
tended to photometry in more than two bands, and to the 
analysis of other kinds of periodic variable stars. 
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Each PC vector (and indeed the "average" light curve 
vector) are simply a sum of sine/cosine terms, the coeffi- 
cients for which are given in Table Al. 

k=16 

Pj = ^a fc sin(27rfct/T) +/3 fc cos(27rfct/T) (A2) 

In order to generate typical Cepheid light curves at any 
given period, the 'jk coefficients can be obtained from the 
fits to the training-set data shown in Figure 3. 

7fc = ^A fc (log(P)-1.4) fe (A3) 

k 

where period, P, is in days. 

The coefficients for these equations are given in Table 
A2, and similarly in Table A3 for the polynomial fits shown 
in Figure 11. 



APPENDIX A: RECONSTRUCTION OF 
CEPHEID LIGHT CURVES 

Although the primary purpose of this paper is to illus- 
trate the general advantages of template fitting in obtain- 
ing Cepheid parameters, some readers may be interested in 
using the coefficients we have determined, for example to 
generate template light curves for their own use. 

The light curves for all Cepheids in both V and I are 
initially decomposed into Fourier terms (equation 1). The 
principal component analysis allows us to rewrite these light 
curves in terms of the PC vectors Pk and an average Cepheid 
light curve A. 



fc=32 

m(t) = m + A + 7fc p fe 
k=i 



(Al) 
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Table Al. The top row in this table gives the coefficients for the average Cepheid light curve, to be used in conjuction with A2. The 
subsequent rows are the coefficients for the first 10 PC vectors. 
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Table A2. The coefficients determined in equation A3 from a polynomial fit (shown in Figure 3) as a function of log period to the 
first four principal component coefficients, 71, 72, 73, 74, for our training set. The final column gives the rms scatter of the data points 
around the fits. 
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Table A3. As for table A2, but in this case describing the polynomial fits displayed in Fig 11. 



